STAT 301 Group Project Final Report¶
Authors: Ellie, Yuxi, Leen, Macy
Introduction¶
A GitHub repository is an efficient tool for code management and collaboration. Whether for personal learning, team development, or open-source projects, it is highly effective for both users and creators. In recent years, with the widespread adoption of the internet and the rise of the big data industry, the demand for repositories on GitHub has grown significantly. Understanding users' needs for repositories with different characteristics can help creators better align with user expectations and grasp industry trends, ultimately contributing to the long-term and sustainable development of the platform. Within this context, one measure of a repository's success and popularity is how many stars it accumulates on github. With the current domination of github in the tech space, developing methods to predict or increase a repository's popularity with users can be essential. This brings us to the following research questions.
Research questions:
- Which fundamental characteristics of a repository influence its popularity?
- Can these fundamental characteristics effectively predict the popularity of a repository?
Alignment with Existing Literature:
Previous research by Hudson Borges, Andre Hora, and Marco Tulio Valente (2016) in "Predicting the Popularity of GitHub Repositories" utilized multiple linear regression to analyze the factors influencing repository popularity. Similarly, in "Characterization and Prediction of Popular Projects on GitHub," Junxiao Han, Shuiguang Deng, Xin Xia, Dongjing Wang, and Jianwei Yin (2019) applied multiple linear regression to examine a different dataset. Despite the variation in data sources, both studies arrived at strikingly similar conclusions: the number of forks exhibits a strong positive correlation with the number of stars, establishing it as a significant predictor of repository popularity. In contrast, variables such as license type and repository creation time were found to have relatively minor impacts, underscoring the limited influence of these factors on the popularity of GitHub repositories. In our report, we utilized a different dataset, explored various input variables, and employed alternative model selection methods to investigate whether other variables could influence the popularity of repositories. In this way, the repository creators can get more comprehensive and accurate guidance to improve their repository’s popularity.
Dataset:
To address our research questions, this study will utilize data from Kaggle. This data was collected through GitHub search API, and contains information on the top 215,000 Github repositories constrained to the repositories with over 167 stars. It includes the following 24 variables:
# Main developer: YUXI
variable_description <- data.frame(
Field = c("Name", "Description", "URL", "Created.At", "Updated.At", "Homepage",
"Size", "Stars", "Forks", "Issues", "Watchers", "Language", "License",
"Topics", "Has.Issues", "Has.Projects", "Has.Downloads", "Has.Wiki",
"Has.Pages", "Has.Discussions", "Is.Fork", "Is.Archived", "Is.Template",
"Default.Branch"),
Description = c("The name of the GitHub repository",
"A brief textual description that summarizes the purpose or focus of the repository",
"The URL or web address that links to the GitHub repository",
"The date and time when the repository was initially created on GitHub",
"The date and time of the most recent update or modification to the repository",
"The URL to the homepage or landing page associated with the repository",
"The size of the repository in bytes, indicating the total storage space used by the repository's files and data",
"The number of stars or likes that the repository has received from other GitHub users, indicating its popularity or interest",
"The number of times the repository has been forked by other GitHub users",
"The total number of open issues",
"The number of GitHub users who are 'watching' or monitoring the repository for updates and changes",
"The primary programming language",
"Information about the software license using a license identifier",
"A list of topics or tags associated with the repository, helping users discover related projects and topics of interest",
"A boolean value indicating whether the repository has an issue tracker enabled",
"A boolean value indicating whether the repository uses GitHub Projects to manage and organize tasks and work items",
"A boolean value indicating whether the repository offers downloadable files or assets to users",
"A boolean value indicating whether the repository has an associated wiki with additional documentation and information",
"A boolean value indicating whether the repository has GitHub Pages enabled, allowing the creation of a website associated with the repository",
"A boolean value indicating whether the repository has GitHub Discussions enabled, allowing community discussions and collaboration",
"A boolean value indicating whether the repository is a fork of another repository",
"A boolean value indicating whether the repository is archived. Archived repositories are typically read-only and are no longer actively maintained",
"A boolean value indicating whether the repository is set up as a template",
"The name of the default branch"),
stringsAsFactors = FALSE
)
cat("Table 1: Description of the variables in our dataset \n")
variable_description
Table 1: Description of the variables in our dataset
| Field | Description |
|---|---|
| <chr> | <chr> |
| Name | The name of the GitHub repository |
| Description | A brief textual description that summarizes the purpose or focus of the repository |
| URL | The URL or web address that links to the GitHub repository |
| Created.At | The date and time when the repository was initially created on GitHub |
| Updated.At | The date and time of the most recent update or modification to the repository |
| Homepage | The URL to the homepage or landing page associated with the repository |
| Size | The size of the repository in bytes, indicating the total storage space used by the repository's files and data |
| Stars | The number of stars or likes that the repository has received from other GitHub users, indicating its popularity or interest |
| Forks | The number of times the repository has been forked by other GitHub users |
| Issues | The total number of open issues |
| Watchers | The number of GitHub users who are 'watching' or monitoring the repository for updates and changes |
| Language | The primary programming language |
| License | Information about the software license using a license identifier |
| Topics | A list of topics or tags associated with the repository, helping users discover related projects and topics of interest |
| Has.Issues | A boolean value indicating whether the repository has an issue tracker enabled |
| Has.Projects | A boolean value indicating whether the repository uses GitHub Projects to manage and organize tasks and work items |
| Has.Downloads | A boolean value indicating whether the repository offers downloadable files or assets to users |
| Has.Wiki | A boolean value indicating whether the repository has an associated wiki with additional documentation and information |
| Has.Pages | A boolean value indicating whether the repository has GitHub Pages enabled, allowing the creation of a website associated with the repository |
| Has.Discussions | A boolean value indicating whether the repository has GitHub Discussions enabled, allowing community discussions and collaboration |
| Is.Fork | A boolean value indicating whether the repository is a fork of another repository |
| Is.Archived | A boolean value indicating whether the repository is archived. Archived repositories are typically read-only and are no longer actively maintained |
| Is.Template | A boolean value indicating whether the repository is set up as a template |
| Default.Branch | The name of the default branch |
The selected dataset includes variables indicating the basic characteristics of the repository. In order to explore the research question, we plan to use the number of stars—which shows the popularity of a repository—as the response variable and use different input variables including number of forks, number of issues, the size of the repository (in KB), whether Discussions, Wiki, Pages and Projects are enabled and whether repository is set up as a template to find a model with good prediction power. The reason why we chose these input variables is because previous research with our selected dataset shows that some variables have little correlation with the popularity of a repository. We have decided to ignore said variables to cut computation times.
The dataset is very large, containing over 215,000 rows or observations (each corresponding to a repository), so we chose to take a stratified random sample of size 1,000 to use for all further data analysis, visualization and modelling.
# Contributors: Ellie, Leen, Macy, Yuxi
library(broom)
library(repr)
library(infer)
library(gridExtra)
library(faraway)
library(mltools)
library(leaps)
library(glmnet)
library(cowplot)
library(modelr)
library(tidyverse)
library(dplyr)
install.packages("gridExtra")
Loading required package: Matrix
Loaded glmnet 4.1-8
Attaching package: ‘modelr’
The following objects are masked from ‘package:mltools’:
mse, rmse
The following object is masked from ‘package:broom’:
bootstrap
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ modelr::bootstrap() masks broom::bootstrap()
✖ dplyr::combine() masks gridExtra::combine()
✖ tidyr::expand() masks Matrix::expand()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ modelr::mse() masks mltools::mse()
✖ tidyr::pack() masks Matrix::pack()
✖ tidyr::replace_na() masks mltools::replace_na()
✖ modelr::rmse() masks mltools::rmse()
✖ lubridate::stamp() masks cowplot::stamp()
✖ tidyr::unpack() masks Matrix::unpack()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Updating HTML index of packages in '.Library'
Making 'packages.html' ...
done
We first read in the dataset and selecting the variables we will be using for our analysis. We then take a stratified random sample of the data based on number of Stars ("high" or "low").
# Main developer: Ellie
set.seed(8035)
library(readr)
sample_size = 500
# Read in the dataset
repo <- read_csv("./repositories.csv")
# Remove unused columns, take stratified random sample grouped by Stars >= median and Stars < median
stars_med = median(repo$Stars)
repo_strat_sample <- repo %>%
select(-Name, -Homepage, -Description, -Watchers, -URL, -'Created At', -'Updated At', -Language, -License, -Topics, -'Default Branch', -'Is Archived', -'Is Fork', -'Has Issues', -'Has Downloads') %>%
mutate(no_stars = ifelse(Stars >= stars_med, "high", "low")) %>%
group_by(no_stars) %>%
sample_n(size = sample_size, replace = FALSE) %>%
ungroup() %>%
select(-no_stars)
cat("Table 2: Stratified Sample of Repositories \n")
head(repo_strat_sample)
Rows: 215029 Columns: 24 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "," chr (8): Name, Description, URL, Homepage, Language, License, Topics, Defau... dbl (5): Size, Stars, Forks, Issues, Watchers lgl (9): Has Issues, Has Projects, Has Downloads, Has Wiki, Has Pages, Has ... dttm (2): Created At, Updated At ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Table 2: Stratified Sample of Repositories
| Size | Stars | Forks | Issues | Has Projects | Has Wiki | Has Pages | Has Discussions | Is Template |
|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <lgl> | <lgl> | <lgl> | <lgl> | <lgl> |
| 28512 | 413 | 119 | 3 | TRUE | TRUE | FALSE | FALSE | FALSE |
| 3627 | 952 | 145 | 132 | TRUE | TRUE | FALSE | FALSE | FALSE |
| 9586 | 1175 | 167 | 29 | TRUE | TRUE | FALSE | FALSE | FALSE |
| 1248 | 848 | 122 | 40 | TRUE | TRUE | FALSE | FALSE | FALSE |
| 6480 | 1283 | 409 | 34 | TRUE | TRUE | TRUE | FALSE | FALSE |
| 3556 | 578 | 104 | 2 | FALSE | FALSE | FALSE | FALSE | FALSE |
cat("Table 3: Summary Statistics of the sample data\n")
summary(repo$Stars)
summary(repo$Forks)
summary(repo$Issues)
Table 3: Summary Statistics of the sample data
Min. 1st Qu. Median Mean 3rd Qu. Max.
167 237 377 1115 797 374074
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 39.0 79.0 234.2 174.0 243339.0
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 3.00 10.00 37.92 28.00 26543.00
All the selected continuous variables in our dataset had right-skewed distributions. This skewness, as well as the extremely large ranges and the presence of extreme outliers, could potentially distort statistical analyses and reduce the interpretability of our data. This indicated the need for a transformation to address these challenges. Applying a log transformation seemed appropriate as it helps stabilize the variance, making the data more homoscedastic, and lessens the impact of extreme values. This approach makes the modeling process more reliable and helps us better understand the relationships between variables, especially when making predictions.
# Main developer: Leen
# Contributor: YUXI
options(repr.plot.width = 12, repr.plot.height =5)
p1 <- ggplot(repo, aes(x = log(Size))) +
geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7) +
ggtitle("Histogram of log(Size)") +
xlab("log(Size)") +
ylab("Frequency") +
theme_minimal() +
theme(
plot.title = element_text(size = 16, face = "bold"),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12)
)
p1
cat("Figure 1: The histogram of the log(Size) ")
Warning message: “Removed 151 rows containing non-finite outside the scale range (`stat_bin()`).”
Figure 1: The histogram of the log(Size)
This is a histogram of one continuous variable Size after applying a log transformation to it. This is now much less skewed.
# Main developer: Leen
# Contributor: YUXI
#plot 2: scatterplot of stars vs forks, point size represents repository size and is colored for has_discussions
p2 <- repo_strat_sample %>%
ggplot(aes(x = Forks, y = Stars, color = `Has Discussions`, size = Size)) +
geom_point(alpha = 0.6) + # semi transparent points to make it easier to visualize even when overlapping
scale_size(range = c(1, 10)) + # adjusting size of the points based on size of repository
scale_color_manual(values = c("red", "blue")) +
labs(title = "Combo Scatterplot: Forks, Stars, and Repository Size (Colored by Has Discussions)",
x = "(Log) Forks",
y = "(Log) Stars",
size = "Repository Size",
color = "Has Discussions") +
scale_x_log10() +
scale_y_log10() +
theme_minimal()+
theme(
plot.title = element_text(size = 16, face = "bold"),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12)
)
p2
cat("Figure 2: A summarized scatterplot of log(Stars) vs log(Forks) ")
Figure 2: A summarized scatterplot of log(Stars) vs log(Forks)
For this combination scatterplot, there seems to be a positive association between (logged) Forks and (logged) Stars, with repositories having discussions enabled (blue) being more common among those with higher star and fork counts. When examining the sizes of the data points, there doesn’t appear to be a clear pattern based on Star levels, suggesting a very weak or nonexistent relationship between (logged) Stars and repository Size.
# Main developer: Leen
# Contributor: YUXI
repositories <- repo %>%
mutate(across(c(`Has Discussions`, `Has Wiki`, `Is Template`), as.factor))
p3 <- repositories %>%
pivot_longer(cols = c(`Has Discussions`, `Has Wiki`, `Is Template`),
names_to = "Variable",
values_to = "Value") %>%
ggplot(aes(x = Value, y = log(Stars), fill = Variable)) +
geom_boxplot(alpha = 0.7) +
facet_wrap(~ Variable, ncol = 4) +
labs(title = "Boxplots of Stars by Repository Features",
x = "Feature Value",
y = "Log(Stars)") +
theme_minimal() +
theme(legend.position = "none")+
theme(
plot.title = element_text(size = 16, face = "bold"),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12)
)
p3
cat("Figure 3: Boxplots Comparing Log(Size) by Repository Features (Has Discussions, Has Wiki, and Is Template) ")
Figure 3: Boxplots Comparing Log(Size) by Repository Features (Has Discussions, Has Wiki, and Is Template)